AITopics

2607.00149

Country: North America > United States (0.67)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

arXiv.org Machine LearningJul-2-2026

Homogenization of $\ell_2$-Adversarial Training in High-Dimensions: Exact Dynamics under Stochastic Gradient Descent

Sabelli, Fabrizzio

We develop a framework for analyzing the learning dynamics of $\ell_2$-adversarial training of single-index models on Gaussian mixtures in the high-dimensional limit under streaming stochastic gradient descent (SGD). We derive deterministic equivalents for a broad class of statistics of the SGD iterates, including the adversarial risk and distance to adversarial optimality, in terms of the solution to a system of ODEs. We use them to study two idealized learning rate schedules: the Polyak stepsize and exact line search. In the case of $\ell_2$-adversarial least squares with a single class, we show that, unlike noiseless standard least squares, no constant learning rate guarantees monotone descent of SGD towards a minimizer of the adversarial risk. We identify anisotropic covariance and a mismatch in ridge parameters as the main sources of suboptimality of exact line search relative to the Polyak stepsize. We also introduce a stochastic differential equation (SDE), called adversarial homogenized SGD, that captures the evolution of statistics of the iterates of SGD. For $\ell_2$-adversarial least squares, using this SDE, we show the evolution of the risk is equivalent, up to dimension-free constants, to that of SGD on standard least squares with an adaptive learning rate and adaptive $\ell_2$-regularization. When the dynamics converge, the limiting adversarial risk and SGD iterate are determined by a fixed-point equation, with the limiting iterate being equivalent to the solution of a ridge regression problem whose regularization parameter is the limiting effective regularization of SGD.

artificial intelligence, def, machine learning, (17 more...)

2607.00207

Country:

North America > United States (0.45)
North America > Canada (0.27)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Smart, Matthew, Ganguly, Soumya, Metya, Nilava, Morozov, Alexandre V., Sengupta, Anirvan M.

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

arXiv.org Machine LearningMay-29-2026

We study minimal attention-only transformers under all-token corruption and show they admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The framework isolates a minimal setting in which the context itself induces a depth-dependent energy landscape governing in-context inference. We show that effective denoising can emerge without an explicit noise schedule: a fixed kernel bandwidth and finite integration horizon suffice, yielding a principled depth-noise relationship. We further establish a posterior-mean recovery guarantee for a class of well-behaved priors, where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions. Connecting these dynamics to reverse-diffusion limits, our results provide a statistical interpretation of attention as in-context inference via sample-based posterior estimation, without explicit density modeling.

approximation, artificial intelligence, machine learning, (17 more...)

2605.29351

Country:

North America > United States (0.46)
Europe (0.45)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Malaxechebarría, Begoña García, Paquette, Courtney, Fazel, Maryam, Drusvyatskiy, Dmitriy

High-dimensional Limit of SGD for Diagonal Linear Networks

arXiv.org Machine LearningMay-19-2026

Understanding the behavior of stochastic gradient methods is a central problem in modern machine learning. Recent work has highlighted diagonal linear networks as a simplified yet expressive setting for analyzing the optimization and generalization properties of neural models. In this work, we show that in the high-dimensional regime, stochastic gradient descent on diagonal linear networks is well-approximated by continuous dynamics governed by a stochastic differential equation (SDE), which explicitly decouples the drift from the gradient noise. We further derive a deterministic partial differential equation whose solution propagates the relevant state of the iterates and characterizes the time evolution of a broad class of observable statistics, including the risk, curvature, and other metrics for optimality. Finally, we show that, under a suitable parametrization, the stochastic dynamics are globally well posed and converge exponentially fast to zero risk with high probability, yielding a fully explicit non-asymptotic description of their long-time behavior. Numerical simulations corroborate our theoretical findings.

artificial intelligence, deep learning, machine learning, (15 more...)

2605.17177

Country: North America > United States > New York (0.27)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Agazzi, Andrea, Bruno, Giuseppe, García, Eloy Mosig, Saviozzi, Samuele, Romito, Marco

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

arXiv.org Machine LearningApr-30-2026

The transformer architecture [52], which underlies present-day Large Language Models, has been one of the main drivers of recent advances in machine learning and artificial intelligence. At each layer, the hidden state of the network is updated by sequentially applying two distinct operations: attention modules [3], which capture long-range interactions in the input sequence, and classical MultiLayer Perceptrons (MLPs), acting separately on each element of that sequence. Despite their empirical success, the mechanisms governing information propagation through depth, and the way attention and MLP blocks jointly shape internal representations, remain only partially understood from a theoretical viewpoint. Recent progress has come from viewing transformers in suitable scaling limits as deterministic mean-field interacting particle systems modeling the evolution of N tokens1 through the layers of the neural network architecture (the so-called residual stream dynamics), see, among others, [46, 26, 27, 45]. In these descriptions, depth plays the role of a continuous time variable, and, in the large-context regime (N), the evolution of token representations is encoded by a PDE for their empirical distribution. This viewpoint is closely connected to the literature on scaling laws, where the effect of various scaling exponents controlling the relative size of the network's hyperparameters (e.g., depth, width, context length) on the effective dynamics of the model

lemma 2, machine learning, natural language, (19 more...)

2604.26898

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)

Neural Information Processing SystemsFeb-18-2026, 11:31:03 GMT

e379877d7880bb1f80c82a9f1c58e6e8-Paper-Conference.pdf

artificial intelligence, assumption, machine learning, (19 more...)

Country:

North America > United States > California > Los Angeles County > Pasadena (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Greece (0.04)

Genre: Research Report > Experimental Study (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Neural Information Processing SystemsFeb-15-2026, 23:58:21 GMT

The Poisson Midpoint Method for Langevin Dynamics: Provably Efficient Discretization for Diffusion Models

LMC can suffer from slow convergence - requiring a large number of steps of small step-size to obtain good quality samples.

artificial intelligence, deep learning, machine learning, (17 more...)

Country:

Asia > Middle East > Jordan (0.04)
North America > United States (0.04)
Africa > Rwanda > Kigali > Kigali (0.04)

Genre:

Research Report > Experimental Study (0.92)
Workflow (0.67)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Neural Information Processing SystemsFeb-10-2026, 11:13:41 GMT

equaltoz = z 1tonormalize andhea Student ' - t-distribp(z) = 8

Let w =( 1.5,0,..0) N(0,0.5) Denoting (25) utionofonwsameasthatof (26) eyobservationisthat.., Z1/2w k are Toseewhythisisthecase, wecanvectorizeeachterm: First, let' Lemma ForanyF :Rd R!R+, define problem 1,..., k, as : = su Next, let' 2021) Provingthe 31 Lf (w, b) C(w)2 n (49) tobetheleft(47)(wherethe ( (w),b)isused depends wonlythrough (w)).

artificial intelligence, machine learning, zhouetal, (19 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsFeb-9-2026, 09:24:36 GMT

A Organization of the Appendices

In the Appendix, we give proofs of all results from the main text. We say a function f: R Y! R is M -Lipschitz if for any y 2Y and ˆ y We can also define the Moreau envelope of a function f: R Y! R by The proof of all results in this section can be straightforwardly extended to these settings. Boyd et al. 2004; Bauschke, Combettes, et al. 2011; Rockafellar 1970), but is also useful and Interestingly, there is a similar equivalent characterization for Lipschitz functions as well. Finally, we show that any smooth loss is square-root-Lipschitz. Lipschitz losses is more general than the class of smooth losses studied in Srebro et al. 2010 .

artificial intelligence, machine learning, probability, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.46)